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Gene-gene interactions liave long been recognized to be funda- 
mentally important to understand genetic causes of complex dis- 
ease traits. At present, identifying gene-gene interactions from 
genome-wide case-control studies is computationally and method- 
ologically challenging. In this paper, we introduce a simple but 
powerful method, named 'BOolean Operation based Screening 
and Testing'(BOOST). To discover unknown gene-gene interac- 
tions that underlie complex diseases, BOOST allows examining 
all pairwise interactions in genome- wide case-control studies in a 
remarkably fast manner. We have carried out interaction anal- 
yses on seven data sets from the Wellcome Trust Case Control 
Consortium (WTCCC). Each analysis took less than 60 hours on 
a standard 3.0 GHz desktop with 4G memory running Windows 
XP system. The interaction patterns identified from the type 1 di- 
abetes data set display significant difference from those identified 
from the rheumatoid arthritis data set, while both data sets share 
a very similar hit region in the WTCCC report. BOOST has also 
identified many undiscovered interactions between genes in the 
major histocompatibility complex (MHC) region in the type 1 di- 
abetes data set. In the coming era of large-scale interaction map- 
ping in genome-wide case-control studies, our method can serve 
as a computationally and statistically useful tool. 

Genome-wide case-control studies use high-throughput genotyping 
technologies to assay hundreds of thousands of single-nucleotide 
polymorphisms (SNPs) and relate them to clinical conditions or mea- 
surable traits. To understand underlying causes of complex disease 
traits, it is often necessary to consider joint genetic effects (Epistasis) 
across the whole genome. The concept of epistasis iQ] was introduced 
around 100 years ago. It is generally defined as interactions among 
different genes. The existence of epistasis has been widely accepted 
as an important contributor to genetic variation in complex diseases 
such as asthma, cancer, diabetes, hypertension, and obesity 1^. As a 
matter of fact, most researchers believe that it is critical to model com- 
plex interactions to elucidate the joint genetic effects causing complex 
diseases. They have demonstrated the presence of gen e-gene interac- 
tions in complex diseases, such as breast cancer lll4ll and coronary 
heart disease ifisll . 

The problem of detecting gene-gene interactions in genome-wide 
case-control studies has attracted extensive research interest. The dif- 
ficulty of this problem is the heavy computational burden. For ex- 
ample, in order to detect pairwise interactions from 500, 000 SNPs 
genotyped in thousands of samples, we need 1.25 x 10^^ statistical 
tests in total. A recent review |@| presented a detailed analysis on 
many popular methods, including PLINK 0], MDR jT3], Tuning Re- 
liefF (ij, Random JunglfUi.e., Random Forest d) and BEAM fivll . 
Among them, PLINK was recommended as the most computation- 
ally feasible method that is able to detect gene-gene interactions in 



genome-wide data0 Q]. PLINK finished pairwise interaction exam- 
ination of 89,294 SNPs selected from the WTCCC Crohn's disease 
data set in 14 days. 

Here we propose a new method, named 'BOolean Operation based 
Screening and Testing' (BOOST), to analyze all pairwise interactions 
in genome-wide SNP data. In our method, we use a boolean repre- 
sentation of genotype data which is designed to be CPU efficient for 
multi-locus analysis. Based on this data representation, we design 
a two-stage (screening and testing) search method. In the screening 
stage, we use a non-iterative method to approximate the likelihood ra- 
tio statistic in evaluating all pairs of SNPs and select those passing a 
specified threshold. Most non-significant interactions will be filtered 
out and the survival of significant interactions is guaranteed. In the 
testing stage, we employ the typical likelihood ratio test to measure 
the interaction effects of selected SNP pairs. 
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Table 1: The number of interactions identified from seven dis- 
eases data set under different constraints. - significant thresh- 
old constraint: The significance threshold is 0.05 for the Bonferroni- 
corrected interaction P- value; - distance constraint: The physical 
distance between two interacting SNPs is at least 1Mb. This constraint 
is used to avoid interactions that might be attributed to the linkage dis- 
equilibrium (LD) effects Q]; - main effect constraint: The single- 
locus P- value should not be less than lO^*". This constraint is used 
to see whether there exist strong interactions without significant main 
effects because those SNPs with P > lO^'' are usually filtered out in 
the typical single-locus scan. 

Results 

We have applied BOOST to analyze data (14,000 cases in total and 
3,000 shared controls) from the Wellcome Trust Case Control Con- 
sortium (WTCCC) on seven common human diseases: bipolar dis- 
order (BD), coronary artery disease (CAD), Crohn's disease (CD), 
hypertension (HT), rheumatoid arthritis (RA), type 1 diabetes (TID) 
and type 2 diabetes (T2D). The analysis of each disease data set with 
control samples took less than 60 hours (around 2.5 days) to com- 
pletely evaluate all pairs of roughly 360, 000 SNPsQ on a standard 
3.0 GHz desktop with 4G memory running Windows XP system. The 
results under different constraints are reported in Table[T] For TID, 
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" Marchini et al. Ildl demonstrated that it is feasible to test association allowing 
for interactions in genome-wide scale. Beside that, Random Jungle can handle genome- 
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we discovered many gene-gene interactions in tlie MHC region (see 
detailed descriptions in the following section). For other six diseases, 
however, we did not find nontrivial interactions (except one interact- 
ing SNP pair in CD). 

TID & RA. The MHC region in chromosome 6 has long been inves- 
tigated as the most variable region in the human genome with respect 
to infection, inflammation, autoimmunity and transplant medicine 
The recent study conducted by WTCCC |15] has shown that both 
TID and RA are strongly associated with the MHC region via single- 
locus association mapping. The top-left panel of Figure [T] shows that 
the single-locus association map does not reveal much difference be- 
tween TID and RA. In our study, BOOST reports 4499 interactions 
in the TID data set (see Table[TJ, in which 4489 interactions (99.8%) 
are in the MHC region. As comparison, BOOST reports 350 inter- 
actions in the RA data set, in which 280 interactions (80.0%) are in 
the MHC region. Our genome-wide interaction map provides the ev- 
idence that the MHC region is associated with these two diseases in 
different ways. The bottom panel of Figure [T] gives detailed interac- 
tion maps in the MHC region for TID and RA data. The LD map|3 
of MHC region is provided in the top-right panel of Figure [T] These 
interaction maps, different from the LD map, reveal a distinct pattern 
difference between TID and RA. Specifically, there are three sub- 
regions in the MHC region, namely, the MHC class I region (29.8Mb - 
31.6Mb), the MHC class III region (31.6Mb - 32.3Mb) and the MHC 
class II region (32.3Mb - 33.4Mb). A closer inspection of the TID 
interaction map indicates that strong interaction effects widely exist 
between genes within and cross three classes, while most significant 
interactions in RA only involve loci closely placed in the MHC class 
II region. The contrast of the interaction patterns between TID and 
RA may explain their different aetiologies, which are not revealed by 
single-locus association mapping. 

Interactions without significant main effects detected in TID. The 

MHC region is a highly polymorphic region with a high gene density. 
Although previous reports |0, fol using the single-locus scan have 
identified strong associations between MHC genes (such as HLA- 
DQBl and HLA-DRBl) and TID, it is still unclear which and how 
many loci within the MHC region determine TID susceptibility. In- 
teractions without significant main effects can provide additional in- 
formation to help pinpoint disease-associated loci because SNPs in- 
volved in those interactions are usually filtered out in the single-locus 
scan. Among the selected 789 interacting pairs in TID, 91 pairs have 
non-significant loci under the single-locus scan (all of them are listed 
in the supplementary). A careful inspection of these 91 interactions 
has identified two interesting interaction patterns between the MHC 
Class I and Class II. Figure[2]presents one interaction pattern between 
the region 31350k-31390k and the region 32810k-32860k in chromo- 
some 6. Another pattern between the region 31350k-31390k and the 
region 32930k - 32960k in chromosome 6 is provided in the supple- 
mentary. The interactions between two regions in Figure [2] are listed 
in Table |2] All SNPs in these interactions display weak main effects 
while their joint effects are statistically significant. As Nejentsev et 
al. 0] argued that both the MHC class I and II genes should be con- 
sidered to better understand type I diabetes susceptibility, our results 
further provide the evidence that the interaction effects between these 
two classes may contribute to the aetiology of type 1 diabetes. 

COMPUTATION TIME 

From a practical point of view, a key issue of detecting gene-gene in- 
teractions in genome-wide case-control studies is the computational 
efficiency Cordell (sj] reported that PLINK took about 14 days 
to test pairwise interactions of the selected 89,294 SNPs on a sin- 
gle node of a computer cluster. A rough estimation implies that it 
would take approximately 228 day^ to test all pairwise interactions 
of 360, 000 SNPs. Random Jungle took about 5 hours to handle the 

■^We calculate composite LD using the method by Zaykin et al. Ilfil . 

^Given the number of samples, the running time of PLINK is proportional to the 
square of the number of SNPs. Therefore, the rough estimation is calculated by 
14 X (360,000/89,294)^ » 228. 
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Table 2: The interaction SNP pairs in the two regions shown in Figure 
[2] The SNPs in the column 'SNP 1' reside in the gene HLA-B and 
The SNPs in the column 'SNP 2' locate at the block across the genes 
HLA-DQA2 and HLA-DQB2. They show strong interactions without 
displaying significant main effects. 

selected 89,294 SNPs genotyped in 5000 samples. It is unknown 
how much time Random Jungle will need to find interactions from 
360, 000 SNPs. However, Random Jungle aims at detecting associ- 
ation allowing for interactions rather than detecting interactions (see 
detailed explanations in the supplementary). Besides, Random Jun- 
gle has the difficulty of finding interacting SNP pairs displaying weak 
main effects because trees built in Random Jungle are conditional on 
the main effects of SNPs. BEAM took about 8 days to handle 47,727 
SNPs using 5x10^ Markov chain Monte Carlo iterations. Currently, 
BEAM fails to handle 500,000 to 1,000,000 SNPs genotyped in 
5000 or more samples. Cordell (s)] recommended PLINK and Ran- 
dom Jungle as two most computationally feasible methods. 

Our method BOOST makes a tremendous progress. It evaluated all 
pairs of roughly 360, 000 SNPs within 60 hours (around 2.5 days) on 
a standard desktop (3.0 GHz CPU with 4G memory running Windows 
XP professional x64 Edition system). The WTCCC phase 2 study 
will analyze over 60,000 samples of various diseases using either the 
Affymetrix v6.0 chip or the Illumina 660K chip. The shared control 
samples will increase from 3, 000 to 6, 000. Such an increase in num- 
bers of SNPs and sample size are more demanding on the computation 
efficiency. We anticipate that BOOST is still applicable to analyze the 
new data sets. 

CONCLUSION 

The large number of SNPs genotyped in genome-wide case-control 
studies poses a great computational challenge in identification of 
gene-gene interactions. During the last few years, there have been fast 
growing interests in developing and applying computational and sta- 
tistical approaches to finding gene-gene interactions. However, many 
approaches fail to handle genome-wide data sets (e.g., 500, 000 SNPs 
and 5, 000 samples). This hampers identification of interactions in 
genome- wide case-control studies. In this paper, we present a method 
named 'BOOST' to address this problem. We have successfully ap- 
plied our method to analyze seven data sets from WTCCC. Not only 
is BOOST computationally efficient, it also has the advantage over 
PLINK with respect to the statistical power (see simulation study in 
the supplementary). Our experiment results demonstrate that inter- 
action mapping is both computationally and statistically feasible for 
hundreds of thousands of SNPs genotyped in thousands of samples. 

METHODS 

Boolean representation of genotype data. Suppose we have C SNPs 
and n samples. The data set is usually stored in an £ x n matrix. 
Each cell in this matrix takes a value in {1, 2, 3}, which represent ho- 
mozygous reference genotype, heterozygous genotype and homozy- 
gous variant genotype, respectively. In our method, we introduce a 
Boolean representation of genotype data (the details are provided in 
the supplementary). This Boolean representation promotes not only 
the space efficiency but also the CPU efficiency because it only in- 
volves Boolean values and thus allows using fast logic (bitwise) oper- 
ations to obtain contingency tables. 

Measuring interaction effects. Logistic regression models are of- 
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Figure 1: Top-left panel: Single-locus association mapping of TID and RA. They share a very similar hit region in chromosome 6. Top- 
right panel: The LD map of the MHC region in control samples. Bottom panel: Genome-wide interaction mapping of TID and RA. 99.8% 
interactions of TID and 80.0% interactions of RA are in the MHC region. Strong interaction effects widely exist between genes in and across 
the MHC class I, II and III in TID, while most significant interactions of RA only involve loci closely placed in the MHC class II region (The 
P values are truncated at P = 1.0 x 10~"). 
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Figure 2: Interactions between the genes in the MHC class I and those in the MHC class II. Left panel (a): The region 31350k - 31390k 
of Chromosome 6. The gene HLA-B in the MHC class I locates in this region. The recombination rate and LD plot from HapMap show 
that a block structure spans from 31360k to 31380k. This region is mapped through the SNPs rs2524057, rs2853934, rs2524115, rs396038, 
rs3873385, rs2524095, and rs2524089. The SNPs rs2524095 and rs2524089 are involved in the interactions with the region 32930k - 32960k 
shown in Figure 1(b) of the supplementary. Right panel (b): The region 32810k - 32860k of Chromosome 6. The genes HLA-DQA2 and 
HLA-DQB2 in the MHC class II reside in this region. The recombination rate and LD plot from HapMap show that a block structure exists 
from 32820k to 32847k. This region is mapped through the genotyped SNPs rs9276448, rs5014418, and rs6919798. The ungenotyped SNPs 
rs9276438 and rs7774954 reside in the genes HLA-DQA2 and HLA-DQB2, respectively. They are in strong LD with those genotyped SNPs. 
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ten used to measure the strength of gene-gene interactions flSl- Let 
Lm and _Lf be the log likelihood of the logistic regression model 
Mm with only main effect terms (df = 4) and the full model Mf 
with both main effect terms and interaction terms {df = 8), respec- 
tively. An interaction effect is measured by the difference of the log 
likelihood evaluated at maximum likelihood estimation (MLE), i.e., 
Lf — Lm. The likelihood ratio statistic 2{Lf — Lm) is asymptot- 
ically distributed with df = 4. However, it is computationally 
unaffordable to directly use this measure to evaluate all pairs of SNPs 
in a genome-wide case-control study because there are hundreds of 
billions of pairs to be tested. 

Noticing the correspondence between a logistic regression model 
and a log-linear model in categorical data analysis {ill, we are able to 
measure an interaction effect under log-linear models. In the space 
of log-linear models, the homogenous model Mh is the equivalent 
form of the main effect model Mm and the saturated model Ms is 
the equivalent form of the full model Mf- Let Lh and Ls be the 
log-likelihood of Mh and Ms evaluated at their MLEs, respectively. 
Interaction effects can thus be measured using Ls — Lh- After some 
algebra, it turns out that Ls — Lh is connected with the Kullback- 
Leibler divergence 

Ls - Lh = n ■ DKL{f^\\t>), (1) 

where n is the number of samples, tt is the joint distribution estimated 
under Ms and p is the joint distribution estimated under Mh- More 
details can be found in the supplementary. 

Boolean operation based screening and testing. In Eq.Q, there 
is no closed-form solution for p under the model Mh- Using iter- 
ative methods to estimate p is computationally intensive to test all 
SNP pairs in genome- wide case-control studies. Here we use a non- 
iterative method known as 'Kirkwood Superposition Approximation' 
(KSA) Ulh to approximate p, denoted as p^. Let Lksa be the log- 
likelihood of KSA model evaluated at its MLE. We show that 

Ls — Lh < Ls — Lksa- (2) 

In other words, Ls — Lksa ~ n ■ Z)/fL(7r||p^) is an upper bound 
of Ls — Lh (please check the supplementary for details). Based on 
this upper bound, we propose our new method 'BOolean Operation 
based Screening and Testing' (BOOST): 

• Stage 1 (Screening): we evaluate all pairwise interactions by 
using KSA in the screening stage. For each pair, the calcu- 
lation of 2(Ls — Lksa) = 2n ■ Dkl(7T'||p^^ ) is based on 
the contingency table collected by using Boolean operations. 
Since 2{Ls — Lh) < 2(Z/s — Lksa), an interaction ob- 
tained by KSA without passing a specified threshold r, i.e., 
2{Ls — Lksa) < t, would not be considered in Stage 2. We 
set the threshold r = 30 in our experiment|f| 

• Stage 2 (Testing): For each pair with 2{Ls — Lksa) > t, 
we test the interaction effect using the likelihood ratio statistic 
2{Ls — Lh)- We fit the log-linear models Mh and Ms, and 
calculate this test statistic using Eq. l[T). After that, we conduct 
the test with df = 4 to determine whether the interaction 
effect is significant. 
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which is a very weak significant level for a genome-wide study. For the WTCCC data, the 
number of interactions passing this threshold ranges from ~ 300, 000 to ~ 600, 000. 
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